perm filename CHAP4[4,KMC]9 blob
sn#022485 filedate 1973-01-30 generic text, type T, neo UTF8
00100 CHAPTER FOUR
00200 SPECIAL PROBLEMS FOR COMPUTER UNDERSTANDING OF NATURAL LANGUAGE
00300 IN TELETYPED PSYCHIATRIC INTERVIEWS
00400
00500
00600 By `natural language` I shall mean everyday American English such as
00700 is used by readers of this book in ordinary conversations. It is
00800 still difficult to be explicit about the processes which enable
00900 hummans to interpret and respond to natural language. Philosophers,
01000 linguists and psychologists have investigated natural language with
01100 various purposes and few useful results. Now attempts are being made
01200 in artificial intelligence to write algorithims which `understand'
01300 natural language expressions.
01400 During the 1960's when
01500 machine processing of natural language was dominated by syntactic
01600 considerations, it became clear that syntactical information alone
01700 was insufficient to comprehend the expressions of ordinary
01800 conversations. The current view is that to understand what is said in
01900 linguistic expressions, syntax and semantics must be combined with
02000 beliefs from an underlying conceptual structure having an ability to
02100 draw inferences. How to achieve this combination efficiently with a
02200 large data-base represents a monumental task for both theory and
02300 implementation.
02400 Since the behavior being simulated by our
02500 paranoid model is the language-behavior of a paranoid patient in a
02600 psychiatric interview, the model must have an ability to interpret
02700 and respond to natural language input sufficient only to demonstrate
02800 language-behavior characteristic of the paranoid mode. How language
02900 is understood depends on the intentions of the producers and
03000 interpreters in the dialogue. Thus language is understood in
03100 accordance with the participant's view of the game being played. Our purpose was to develop a
03200 method for understanding everyday English sufficient for the model to
03300 communicate linguistically in a paranoid way in the circumscribed
03400 situation of a psychiatric interview.
03500 We did not try to construct a general-purpose algorithm which could
03600 understand anything said in English by anybody to anybody in any
03700 dialogue situation. (Does anyone believe it possible?)
03800 We took as a pragmatic measure of "understanding" the ability
03900 of the algorithm to `get the message' of an expression by trying to classify
04000 the imperative or directive intent of the interviewer,i.e.what effect he is
04100 trying to bring about in the interviewee relative to the topic. This
04200 straightforward approach to a complex problem has its drawbacks, as
04300 will be shown, but we strove for a highly individualized idiolect sufficient
04400 to demonstrate paranoid processes of an individual in a particular
04500 situation rather than for a general supra-individual or ideal comprehension
04600 of English. If the language-understanding algorithm interfered with
04700 demonstrating the paranoid processes, we would consider it defective
04800 and insufficient for our purposes. (Insert from Machr
04900 here)
05000 Some special problems a dialogue algorithm must cope with in a
05100 psychiatric interview will now be discussed.
05200
05300 QUESTIONS
05400
05500 The principal sentence-type used by an interviewer consists
05600 of a question. The usual wh- and yes-no questions must be recognized
05700 by the language-algorithm. In teletyped interviews a question may
05800 sometimes be put in declarative form followed by a question mark as in:
05900 (1) PT.- I LIKE TO GAMBLE ON THE HORSES.
06000 DR.- YOU GAMBLE?
06100
06200 Particularly difficult are `when' questions which require a memory
06300 which can assign each event a beginning, end and a duration. Also
06400 troublesome are questions such as `how often', `how many', i.e. a
06500 `how' followed by a quantifier.
06600 In constructing a simulation of a thought process it is
06700 arbitrary how much information to represent in memory. Should the
06800 model know what is the capital of Alabama? It is trivial to store a lot of facts. We took the position that
06900 the model should know only what we believed it reasonable to know
07000 relevant to a few hundred topics expectable in a psychiatric interview.
07100 Thus the model performs badly when subjected to baiting `exam'
07200 questions designed to test its informational limitations rather than to seek useful
07300 psychiatric information.
07400 IMPERATIVES
07500
07600 Typical imperatives in a psychiatric interview consist of
07700 expressions like:
07800 (2) DR.- TELL ME ABOUT YOURSELF.
07900 (3) DR.- LETS DISCUSS YOUR FAMILY.
08000 Such imperatives are
08100 actually interrogatives to the interviewee about the topics they refer to. Since
08200 the only physical action the model can perform is to `talk' ,
08300 imperatives should be treated as requests for information.
08400 DECLARATIVES
08500
08600 In this category we lump everything else. It includes
08700 greetings, farewells, yes-no type answers, existence assertions and
08800 predications made upon a subject.
08900
09000 AMBIGUITIES
09100
09200 Words have more than one sense, a convenience for human
09300 memories but a struggle for language-analysing algorithms. Consider the
09400 word `bug' in the following expressions:
09500 (4) AM I BUGGING YOU?
09600 (5) AFTER A PERIOD OF HEAVY DRINKING HAVE YOU FELT BUGS ON
09700 YOUR SKIN?
09800 (6) DO YOU THINK THEY PUT A BUG IN YOUR ROOM?
09900 In expression (4) the term
10000 `bug' means to annoy, in (5) it refers to an insect and in (6) it
10100 refers to a microphone used for hidden survellience. Some common words like
10200 `run' have fifty or more common senses. Context must be used to carry
10300 out disambiguation, as described in 00.0. Also we have the advantage
10400 of an idiolect where we can arbitrarily restrict the word senses. One
10500 characteristic of the paranoid mode is that no matter in what sense
10600 the interviewer uses a word, the patient may idiosyncratically
10700 interpret it in some sense relevant to his pathological malevolence
10800 beliefs.
10900
11000 ANAPHORIC REFERENCES
11100 The common anaphoric references consist of the pronouns `it',
11200 `he', `him', `she', `her', `they', `them' as in:
11300 (7) PT.-HORSERACING IS MY HOBBY.
11400 (8) DR.-WHAT DO YOU ENJOY ABOUT IT?
11500 The algorithm must
11600 recognize that the 'it' refers to `horseracing'. More difficult is a
11700 reference more than one I/O pair back in the dialogue as in:
11800 (9) PT.-THE MAFIA IS OUT TO GET ME.
11900 (10) DR.- ARE YOU AFRAID OF THEM?
12000 (11) PT.- MAYBE.
12100 (12) DR.- WHY IS THAT?
12200 The "that" of expression (12) does not refer to
12300 (11) but to the topic of being afraid which the interviewer
12400 introduced in (10). Another pronominal confusion occurs when the
12500 interviewer uses `we' in two senses as in:
12600 (13) DR.- WE WANT YOU TO STAY IN THE HOSPITAL.
12700 (14) PT.- I WANT TO BE DISCHARGED NOW.
12800 (15) DR.- WE ARE NOT COMMUNICATING.
12900 In expression (13) the interviewer
13000 is using "we" to refer to psychiatrists or the hospital staff while
13100 in (15) the term refers to the interviewer and patient.
13200
13300 TOPIC SHIFTS
13400
13500 In the main a psychiatric interviewer is in control of the
13600 interview. When he has gained sufficient information about a topic,
13700 he shifts to a new topic. Naturally the algorithm must detect this
13800 change of topic as in the following:
13900 (16) DR.- HOW DO YOU LIKE THE HOSPITAL?
14000 (17) PT.- ITS NOT HELPING ME TO BE HERE.
14100 (18) DR.- WHAT BROUGHT YOU TO THE HOSPITAL?
14200 (19) PT.- I AM VERY UPSET AND NERVOUS.
14300 (20) DR.- WHAT TENDS TO MAKE YOU NERVOUS?
14400 (22) PT.- JUST BEING AROUND PEOPLE.
14500 (23) DR.- ANYONE IN PARTICULAR?
14600 In (16) and (18) the topic is the hospital. In (20) the
14700 topic changes to causes of the patient's nervous state.
14800 When a topic is introduced by the patient as in (19),
14900 a number of things can be expected to be asked about it. Thus
15000 the algorithm can have ready an expectancy-anaphora list which
15100 allows it to determine whether the topic
15200 introduced by the model is being responded to or whether the interviewer
15300 is continuing with the previous topic.
15400 Topics touched upon previously can be re-introduced
15500 at any point in the interview. The memory of the model is responsible
15600 for knowing what has been discussed.
15700
15800 META-REFERENCES
15900
16000 These are references, not about a topic directly, but about
16100 what has been said about the topic as in:
16200 (24) DR.- WHY ARE YOU IN THE HOSPITAL?
16300 (25) PT.- I SHOULDNT BE HERE.
16400 (26)DR.- WHY DO YOU SAY THAT?
16500 The expression (26 ) is about and meta to expression (25 ).
16600 Sometimes when the patient makes a statement, the doctor replies,
16700 not with a question, but with another statement which constitutes a
16800 rejoinder as in:
16900 (27 ) PT.- I HAVE LOST A LOT OF MONEY GAMBLING.
17000 (28 ) DR.- I GAMBLE QUITE A BIT ALSO.
17100 Here the algorithm should interpret (28 ) as a directive to continue
17200 discussing gambling, not as an indication to question the doctor about
17300 gambling. The one exception to this principle occurs when the algorithm
17400 recognizes a chance to add to its model or representation of the interviewer.
17500 ELLIPSES
17600
17700
17800 In dialogues one finds many ellipses, expressions
17900 from which one or more words are omitted as in:
18000 (29 ) PT.- I SHOULDNT BE HERE.
18100 (30) DR.- WHY NOT?
18200 Here the complete construction must be understood as:
18300 (31) DR.- WHY SHOULD YOU NOT BE HERE?
18400 By saving the previous surface expression and the belief it mapped
18500 into in memory, the algorithm can recognize either what the missing words
18600 are or the concepts they refer to.
18700 The opposite of ellipsis is redundancy which usually provides no
18800 problem since the same thing is being said more than once as in:
18900 (32 ) DR.- LET ME ASK YOU A QUESTION.
19000 If an analysis were required of this expression (it is not
19100 required here since the expression is a sterotype), it would be recognized
19200 that the verb "ask" takes the noun "question" as direct object and
19300 also a question is something that is asked.
19400
19500 SIGNALS
19600
19700 Some fragmentary expressions serve only as directive signals
19800 to proceed as in:
19900 (33) PT.- I WENT TO THE TRACK LAST WEEK.
20000 (34) DR.- AND?
20100 The fragment of (34) requests a continuation of the story
20200 introduced in (33). The common expressions found in interviews are
20300 "and", "so", "go on", "go ahead", "really", etc. If an input expression
20400 cannot be recognized at all, the lowest level fedault condition is
20500 to assume it is a signal and either proceed with the next line in a story under discussion
20600 or if the latter is not the case, begin a new story with a prompting
20700 question or statement.
20800 This strategy can fail as in:
20900 (FIND GOOD EXAMPLE)
21000
21100 IDIOMS
21200
21300 Since so much of conversational language is stereotyped, the task
21400 of recognition is much easier than that of analysis.
21500 This is particularly true of idioms. Either one knows what an idiom means
21600 or one does not. It is usually hopeless to try to decipher what an
21700 idiom means from an analysis of its constituent parts. If the reader
21800 doubts this, let him ponder the following expressions taken from
21900 actual teletyped interviews.
22000 (35) DR.- WHATS EATING YOU?
22100 (36) DR.- YOU SOUND KIND OF PISSED OFF.
22200 (37) DR.- WHAT ARE YOU DRIVING AT?
22300 (38) DR.- ARE YOU PUTTING ME ON?
22400 (39) DR.- WHY ARE THEY AFTER YOU?
22500 (40) DR.- HOW DO YOU GET ALONG WITH THE OTHER PATIENTS?
22600 (41) DR.- HOW DO YOU LIKE YOUR WORK?
22700 (42) DR.- HAVE THEY TRIED TO GET EVEN WITH YOU?
22800 (43) DR.- I CANT KEEP UP WITH YOU.
22900 Understanding idioms is a matter of rote memory. Hence
23000 an algorithm with a large idiom table is required. As each new idiom
23100 appears in teletyped interviews, it should be added to the idiom table
23200 because what happens once can happen again.
23300 One advantage in constructing an idiolect for a model is that
23400 it understands its own idiomatic expressions which tend to be used
23500 by the interviewer if he understands them as in:
23600 (44) PT.- THEY ARE OUT TO GET ME.
23700 (45) DR.- WHAT MAKES YOU THINK THEY ARE OUT TO GET YOU.
23800 The expression (45 ) is really a double idiom in which "out"
23900 means `intend' and "get" means `harm' in this context. Needless to say.
24000 an algorithm which tried to pair off the various meanings of "out" with
24100 the various meanings of "get" would have a hard time of it. But an
24200 algorithm which understands what it itself is capable of saying, should
24300 be able to recognize echoed idioms.
24400
24500 FUZZ TERMS
24600
24700 In this category we group a large number of expressions which
24800 have little or no meaning and therefore can be ignored by the algorithm.
24900 The lower-case expressions in the following are examples of fuzz:
25000 (46) DR.- well now perhaps YOU CAN TELL ME something ABOUT YOUR FAMILY.
25100 (47) DR.- on the other hand I AM INTERESTED IN YOU.
25200 (48) DR.- hey I ASKED YOU A QUESTION.
25300 It is not the case that in order to ignore something one must
25400 recognize explicitly what is ignorable. Since pattern-matching allows
25500 for an `anything' slot in many of its patterns, fuzz is thus easily ignored.
25600
25700 SUBORDINATE CLAUSES
25800
25900 A subordinate clause is a complete statement inside another statement.
26000 It is most frequently introduced by a relative pronoun, indicated in the
26100 following expressions by lower case:
26200 (49) DR.- WAS IT THE UNDERWORLD that PUT YOU HERE?
26300 (50) DR.- WHO ARE THE PEOPLE who UPSET YOU?
26400 (51) DR.- HAS ANYTHING HAPPENED which YOU DONT UNDERSTAND?
26500 The words "whether" and "because" serving as conjunctions are less
26600 frequent. A language-algorithm also must recognize that subordinate clauses
26700 can function as nouns, adjectives, adverbs, and objects of prepositions.
26800
26900 VOCABULARY
27000
27100 How many words should there be in the algorithm's vocabulary?
27200 It is a rare human speaker of English who can recognize 40% of the
27300 415,000 words in the Oxford English Dictionary. In his everyday
27400 conversation an educated person uses perhaps 10,000 words and has
27500 a recognition vocabulary of about 50,000 words. A study of phone
27600 conversations showed that 96 % of the talk employed only 737 words. Of
27700 course the remaining 4% , if not recognized, may be ruinous to the
27800 continuity of a conversation.
27900 In counting the words in 53 teletyped psychiatric interviews,
28000 we found psychiatrists used only 721 words. Since we are familiar with
28100 psychiatric vocabularies and styles of expression, we believed this
28200 language-algorithm could function adequately with a vocabulary
28300 of a few thousand words. There will always be unrecognized words. The
28400 algorithm must be able to continue even if it does not have a particular word
28500 in its vocabulary. This provision represents one great advantage of
28600 pattern-matching over conventional linguistic parsing.
28700 It is not the number of words which creates difficulties but their
28800 combinations. One thousand factorial is still a very large number. Syntactic
28900 and semantic constraints in stereotypes and in analysis reduce this
29000 number to an indefinitely large one.
29100
29200 MISSPELLINGS AND EXTRA CHARACTERS
29300
29400
29500 There is really no good defense against misspellings
29600 in a teletyped interview except having a human monitor retype the correct
29700 versions. Spelling correcting programs are slow, inefficient, and imperfect.
29800 They experience great problems when it is the first character in a word
29900 which is incorrect.
30000 Extra characters sent by the interviewer or by a bad phone
30100 line can be removed by a human monitor.
30200
30300 META VERBS
30400
30500 Certain common verbs such as "think", "feel", "believe", etc
30600 take as their objects a clause as in:
30700 (53) DR.- I THINK YOU ARE RIGHT.
30800 (54) DR.- WHY DO YOU FEEL THE GAMBLING IS CROOKED?
30900 The verb "believe" is peculiar since it can also take as
31000 object a noun or noun phrase as in:
31100 (55) DR.- I BELIEVE YOU.
31200 In expression (54) the conjunction "that" can follow
31300 the word "feel" signifying a subordinate clause. This is not the case
31400 after "believe" in expression (55).
31500
31600 ODD WORDS
31700
31800 These are words which are odd in the context of a
31900 teletyped interview while they are quite natural in the usual vis-a-vis
32000 interview in which the participants communicate through speech. This
32100 should be clear from the following examples in which the odd words
32200 appear in lower case:
32300 (56) DR.-YOU sound CONFUSED.
32400 (57) DR.- DID YOU hear MY LAST QUESTION?
32500 (58) DR.- WOULD YOU come in AND sit down PLEASE?
32600 (59) DR.- CAN YOU say WHO?
32700 (60) DR.- I WILL see YOU AGAIN TOMORROW.
32800
32900
33000 MISUNDERSTANDING
33100
33200 It is not fully recognized bt students of language how often people
33300 misunderstand one another in conversation and yet their
33400 dialogues proceed as if understanding and being understood had taken
33500 place.
33600 The classic story involves three partially deaf men cycling
33700 through the English counrtyside:
33800 FIRST - "WHAT TOWN IS THIS?"
33900 SECOND - "THURSDAY"
34000 THIRD - "ME TOO, LETS STOP AND HAVE A DRINK."
34100 Sometimes a psychiatric interviewer realizes when misunderstanding
34200 occurs and tries to correct it. Other times he simply passes it by. It is
34300 characteristic of the paranoid mode to respond idiosyncratically to
34400 particular word-concepts regardless of what the interviewer is saying:
34500 (FIND GOOD EXAMPLE)
34600
34700 UNUNDERSTANDING
34800 A dialogue algorithm must be prepared for situations
34900 in which it simply does not understand i.e. it cannot arrive at any
35000 interpretation as to what the interviewer is saying. An algorithm should
35100 not be faulted for a lack of facts as in:
35200 (61) DR.- WHO IS THE PRESIDENT OF TURKEY?
35300 wherin the memory does not contain the words "president" and "Turkey".
35400 In this default condition it is simplest to reply:
35500 (62) PT.- I DONT KNOW.
35600 and dangerous to reply:
35700 (63) PT.- COULD YOU REPHRASE THE QUESTION?
35800 because of the horrible loops which can result.
35900 Since the main problem in the default condition of ununderstanding
36000 is how to continue, heuristics can be employed such as asking about the
36100 interviewer's intention as in:
36200 (64) PT.- WHY DO YOU WANT TO KNOW THAT?
36300 or rigidly continuing with a previous topic or introducing a new topic.
36400 These are admittedly desperate measures intended to prompt
36500 the interviewer in directions the algorithm has a better chance of understanding.
36600 Usually it is the interviewer who controls the flow from topic to
36700 topic but there are times, hopefully few, when control must be assumed
36800 by the algorithm.